Background Story
Gradient descent is not efficient in variational inference, because probability distributions do not naturally live in Euclidean space but rather on a statistical manifold. There are better ways of defining the distance between distributions, one of the simplest being the symmetrized Kullback-Leibler divergence:
KLsym(p1,p2)=12(KL(p1||p2)+KL(p2||p1))
In differential geometry, distance on a manifold is given by the bilinear form
∥dϕ∥2=⟨dϕ,G(ϕ)dϕ⟩=∑ijgij(ϕ)dϕidϕj
The matrix
G(ϕ)=[gij(ϕ)] is called the
Riemannian metric tensor.
In Euclidean space with an orthonormal basis
G(ϕ) is simply the identity matrix. When
Φ is a space of parameters of probability distributions and the symmetrized KL divergence is used to measure the distance between distributions then
G(ϕ) turns out to be the
Fisher information matrix:
((θ))i,j=E[(∂∂θilogf(X;θ))(∂∂θjlogf(X;θ))∣∣∣θ].
The Story
In gradient ascent (of the evidence lower bound in variational inference), we want to maximize:
L(ϕ+dϕ)=L(ϕ)+ϵ∇L(ϕ)Tv
with constraint:
∥v∥2=⟨v,G(ϕ)v⟩=1. Solve with Lagrange mulitpliers, we get the
natural gradient by multiplying the inverse of the Fisher information matrix and the first derivative:
G(ϕ)−1∇L(ϕ)
reference
The Natural Gradient: https://hips.seas.harvard.edu/blog/2013/01/25/the-natural-gradient/